Skip to content

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Sep 1, 2025

Which issue does this PR close?

Closes #2292

Rationale for this change

Increase coverage of TPC-H benchmark.

What changes are included in this PR?

Add support for COUNT(DISTINCT expr)

How are these changes tested?

@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 42.30769% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.43%. Comparing base (f09f8af) to head (6e124b4).
⚠️ Report is 449 commits behind head on main.

Files with missing lines Patch % Lines
...main/java/org/apache/comet/vector/CometVector.java 0.00% 15 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2273      +/-   ##
============================================
+ Coverage     56.12%   57.43%   +1.30%     
- Complexity      976     1287     +311     
============================================
  Files           119      146      +27     
  Lines         11743    13387    +1644     
  Branches       2251     2377     +126     
============================================
+ Hits           6591     7689    +1098     
- Misses         4012     4431     +419     
- Partials       1140     1267     +127     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@comphead
Copy link
Contributor

comphead commented Sep 2, 2025

That looks incredibly useful!

@andygrove andygrove changed the title feat: Add support for COUNT(DISTINCT expr) feat: [WIP] Add support for COUNT(DISTINCT expr) Sep 3, 2025
@andygrove
Copy link
Member Author

I have test failures with columnar shuffle trying to call unimplemented method getLong on CometListVector. I suspect that this is related to writing shuffle output from a partial aggregate. I am putting this on hold for now and perhaps try and get this into the 0.11.0 release

@andygrove
Copy link
Member Author

I understand the issue now. For COUNT(DISTINCT bool_col) the partial count outputs all of the distinct values for bool_col so we have a boolean list vector containing [[true], [false]]. Columnar shuffle writer does not understand this and thinks that the schema is a single LongType representing the final output from the count.

@andygrove
Copy link
Member Author

I understand the issue now. For COUNT(DISTINCT bool_col) the partial count outputs all of the distinct values for bool_col so we have a boolean list vector containing [[true], [false]]. Columnar shuffle writer does not understand this and thinks that the schema is a single LongType representing the final output from the count.

If I fall back to Spark for the shuffle, I see the same issue. So the problem is perhaps that we report the wrong schema as the output from the partial aggregate

@andygrove
Copy link
Member Author

I learned a lot from this WIP PR and have documented my findings in #2292 so I will close this for now.

@andygrove andygrove closed this Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for COUNT(DISTINCT expr)
3 participants